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Abstract 

Summary: Both theory and empirical evidence indicate that phy- 
logenies (trees) of different genes (loci) do not display precisely matched 
topologies. This phylogenetic incongruence is attributed to the reticu¬ 
lated evolutionary history of most species due to meiotic sexual recom¬ 
bination in eukaryotes, or horizontal transfers of genetic materials in 
prokaryotes. Nonetheless, most genes do display topologically related 
pliylogenies; this implies they form cohesive subsets (clusters). In this 
work, we compare popular clustering methods, and show how the per¬ 
formance of the normalized cut framework is efficient and statistically 
accurate when obtaining clusters on the set of gene trees based on 
the geodesic distance between them over the Billera-Holmes-Vogtmann 
(BHV) tree space. We proceed to present a computational study on the 
performance of different clustering methods with and without prepro¬ 
cessing under different distance metrics and using a series of dimension 
reduction techniques. 

Results: First, we show using simulated data that indeed the Ncut 
framework accurately clusters the set of gene trees given a species tree 
under the coalescent process. We then depict the success of our frame¬ 
work by comparing its performance to other clustering techniques, 
including k-means and hierarchical clustering. The main computa¬ 
tional results can be summarized to the stellar performance of the 
Ncut framework even without dimension reduction, the similar perfor¬ 
mance portrayed by k-means and Ncut under most dimension reduc¬ 
tion schemes, the utter failure of hierarchical clustering to accurately 
capture clusters, as well as the significantly better performance of the 
NJp method, as compared to MLE. 
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1 Introduction 


During especially this last decade, the field of phylogenetics has been un¬ 
dergoing a gradual paradigm shift away from the notion of the strictly bi¬ 
furcating, completely resolved species trees, to a recognition that species 
are containers of allelic variation for each gene. It is very well established 
that differences in lineage sorting due to genetic drift lead to differences 
in phylogenetic tree topologies (Maddison, 1997). Gene flow in ancestral 


populations and independent lineage sorting of polymorphisms is fully ex¬ 
pected to generate topological conflicts between gene trees in reticulating 
(e.g., sexually recombining) species (|Huson et al. 2005 Weisrock et al. 


2006, Taylor et al, 2000). Both extant and ancestral species could exhibit 


this phenomenon, so ancestral species should not be regarded as node points 
in a fully resolved bifurcating tree, but instead can be thought of as spa- 
tiotemporal clouds of individual genotypes with all their inherent allelism. 
Thus, a central issue in systematic biology is the reconstruction of popu¬ 
lations and species from numerous gene trees with varying levels of discor¬ 
dance (Brito and Edwards, 2009 Edwards 2009). While there has been a 


well-established understanding of the discordant phylogenetic relationships 
that can exist among independent gene trees drawn from a common species 


tree ( 

Parnilo and Nei, 

1988 

Takahata 

1989 

Maddison, 

1997; 

Bollback and 

Huelsenbeck 



), phylogenetic studies have only recently begun to shift 


away from single gene or concatenated gene estimates of phylogeny towards 


these multi-locus approaches (e.g. 

Carling and Brumfield 

00 

o 

o 

CM 

Yu et al. 

(2011); Betancur et al. 

(2013); Heled and Drummond 

(2011) 

; Thompson and 


Kubatko (2013)). 


There exist numerous processes that can reduce the correlation among 
gene trees. Negative or balancing selection on a particular locus is expected 
to increase the probability that ancestral gene copies are maintained through 


speciation events ( 

Takahata and Nei, 

1990). Horizontal transfer shuffles di- 

vergent genes among different species ( 

Maddison, 

1997 

). Correlation may 


also be reduced by naive sampling of loci for analysis. For example, paral- 
ogous gene copies will result in a gene tree that conflates gene duplication 
with speciation. Similarly, sampled sequence data that span one or more 
recombination events will yield “gene trees” that are hybrids of two or more 
genealogical histories (Posada and Crandall, 2002). These non-coalescent 


processes can strongly influence phylogenetic inference (Posada and Cran- 


dall 

2002, 

Martin and Burg 

et al. 

(1998 

) showed that an s 


2002 Edwards, 2009). In addition, Rivera 


sive prokaryotic gene transfer (or transfers) preceding the formation of the 
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eukaryotic cell, arguing that there is significant genomic evidence for more 
than one distinct class of genes. These examples suggest that the distribu¬ 
tion of eukaryotic gene trees may be more accurately modeled as a mixture 
of a number of more fundamental distributions. In order to find a mixture 
structure in distributions of gene trees, we first need to find how many com¬ 
ponents of distributions there are in the mixture. This is the main reason 
why in this work we focus on the problem of clustering gene trees over the 
“tree space”. 

Many researchers take an approach to apply a likelihood based method, 
such as the maximum likelihood estimator (MLE) or Bayesian inference on 
the concatenated alignment from gene alignments in order to reconstruct the 
species tree. However, Roch and Steel (2015) showed that if we apply a like¬ 
lihood based method on the concatenated alignment from gene alignments, 
then the resulting trees might be statistically inconsistent because some gene 
trees are significantly incongruent from the species tree due to incomplete 
lineage sorting, horizontal gene transfer, among other reasons. More pre¬ 
cisely, they showed that under the multi-species coalescent with a standard 
site substitution model, such as the general time reversible (GTR) model 
(Tavare 1986) etc, the MLE on a sequence data concatenated across genes 
under the assumption that all sites have evolved independently and identi¬ 
cally on a fixed tree is a statistically inconsistent estimator of the species 
tree. 

Typically, statistical analysis on phylogenetic trees is conducted by map¬ 
ping each tree to a vector in M rf , d G N: this is referred to as a dissimilarity 
map. Given any tree T of n leaves with branch length information, one 
may produce a corresponding distance matrix , D(T). This distance matrix 
is an n x n symmetric matrix of non-negative real numbers, with elements 
corresponding to the sum of the branch lengths between pairs of leaves in 
the tree. To calculate Dnj\(T ), one simply determines which edges of the 
tree form the path from a leaf i to a leaf j, and then sums the lengths of 
these branches. Since D(T ) is symmetric and has zeros on the diagonal, the 
upper-triangular portion of the matrix contains all of the unique information 
found in the matrix. We can vectorize a tree T by enumerating this unique 
portion of the distance matrix, 


v d (T) := (Di 2 (T), D 13 (T ),..., D 23 (T ),..., £> n _i, n (T)) 

which is called the dissimilarity map of a tree T and is a vector in M^a). 

However, the space of phylogenetic trees with n leaves is not a Euclidean 
space. In fact, it is represented as the union of lower dimensional polyhedral 
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Billera et al. 


(2001) introduced a continuous space which 


cones m 

explicitly models the set of rooted phylogenetic trees with edge lengths on 
a fixed set of leaves. Although the Billera-Holmes-Vogtmann (BHV) tree 
space is not Euclidean, it is non-positively curved, and thus any two points 
are connected by a unique shortest path through the space, called a geodesic. 
In this computational study, we show, among other things, that using the 
BHV tree space can help produce more statistically accurate results. 

This paper focuses on such tree spaces and presents a computational 
study on multi-loci phylogenetic analysis using gene tree clustering over the 
BHV tree space. More specifically, this work presents the differences and 
the performance of a normalized cut framework, fc-means, and hierarchical 
clustering in the Euclidean and the BHV tree space, and under different 
dimension reduction approaches using simulated datasets. The paper is 
organized as follows. Section [2] offers a basic review of the BHV space, the 
normalized cut framework, and the dimension reduction for the interested 
reader. In Section [3j we present our computational study and the results we 
obtained using simulated datasets. Moreover, Section [4] discusses the results 
and focuses on our main computational results, while also summarizing our 
work and offering insight for possible future directions. 


2 Fundamentals 


In this paper, we present a comparative study of different methods for multi¬ 
loci phylogenetic analysis using gene tree clustering based on the distance 
matrix obtained by the geodesic distances between two trees over the BHV 
space. The methods we compare are the normalized cut framework, based on 
the seminal contribution by Shi and Malik ( |2000 ), A:-means (e.g., Schenker 
et al. (2003)), and hierarchical clustering (the interested reader is referred 


Maimon and Rokach (2005) for an excellent overview). 


to 

Furthermore, we investigate how dimension reduction methods can be 
applied in order to extract a lower dimensional structure before clustering 
and whether that affects the quality of solutions compared to applying our 
clustering methods directly upon the original distance matrix. This reduc¬ 
tion also helps with the visualization of the data. For dimension reduction, 
kernel principal component analysis (KPCA Scholkopf et al. (1998])) and 
t-stochastic neighborhood embedding (t-SNE van der Maaten and Hinton 


(2008)) are employed among many other methods. Hereafter, we refer to 
the above three branches as direct, KPCA, and t-SNE, respectively. 

We now proceed to offer some basics on the BHV tree space, the nor- 
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malized cut problem, and the different dimension reduction techniques used 
herein. 


2.1 Billera-Holmes-Vogtmann Tree Space 


Billera et al. (2001) introduced a continuous space that models the set of 
rooted phylogenetic trees with edge lengths on a fixed set of leaves. (Un¬ 
rooted trees can be accommodated by using either the Ferras transform, or 
by designating an arbitrary leaf node as the root.) It is known that in the 
Billera-Holmes-Vogtmann (BHV) tree space any two points are connected 
by a geodesic, and the distance between two trees is defined as the length of 
the geodesic connecting them. 

Consider a rooted tree with n leaves. Such a tree has at most 2n — 2 
edges; there are n terminal edges, which are connected to leaves, and as 
many as n — 2 internal edges. The maximum number of edges is achieved 
when the tree is binary, but the number of edges can be lower if the tree 
contains any polytomies. With each distinct tree topology, we associate a 
Euclidean orthant of dimension equal to the number of edges that the topol¬ 
ogy possesses. (Here, we may regard an orthant to be the subset of with 
all coordinates non-negative.) For each topology, the orthant coordinates 
correspond to edge lengths in the tree. 

Since all tree topologies have the same set of n terminal leaves, and 
each of these leaves is associated with a single terminal edge, the orthant 
coordinates associated with the terminal edges are of less interest than those 
of internal nodes. As a result, we will simplify our discussion by ignoring 
the terminal edge lengths, and concern ourselves primarily with the portion 
of each orthant which describes the internal edges. (Recall that this space 
has at most n — 2 dimensions.) 

Since each of the coordinates in a simplified orthant corresponds to an 
internal edge length, the orthant boundaries (where at least one coordinate 
is zero) represent trees with collapsed internal edges. These points can be 
thought of as trees with slightly different—but closely related—topologies. 
The BHV space is constructed by noting that the boundary trees from two 
different orthants may describe the same polytomic topology. With this in¬ 
sight, we may set about constructing the space by grafting orthant bound¬ 
aries together when the trees they represent coincide. 

Since each orthant is locally a Euclidean space, the shortest path between 
two points within a single orthant is a straight line. The difficulty comes 
in establishing which sequence of orthants joining the two topologies will 
contain the geodesic. In the case of four leaves, we could do this through a 
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brute-force search, but we cannot hope to do so with larger trees. Owen and] 
Provan (2011) present a quartic-time algorithm (in the number of leaves n) 
for finding the geodesic path between any two points in the space. Once the 
geodesic is known, computing its length—and thus the distance between the 
trees—is a simple matter. 


2.2 Clustering 

Given a set of gene trees for the species in analysis, a clustering algorithm 
is applied based on the distance matrix containing the geodesic distances in 
the BHV tree space. As an alternative, dimension reduction may be applied 
before the clustering when directly applying the clustering techniques proves 
unfruitful. This is also helpful for visualization purposes. For the details 
of the clustering and dimension reduction methods, see the Supplementary 
Material. 

There are many standard clustering methods raging from non-hierarchical 
clustering such as k-means to hierarchical clustering methods. Among oth¬ 
ers, this paper considers three methods: normalized cut, k-means, and hi¬ 
erarchical clustering (average linkage). The k-means is the most standard 
non-hierarchical clustering method, which has been used in a large number 
of applications. Note, however, that with BHV geodesic distance the k- 
rnean methods is not computationally feasible, because update of centroids 
required in the methods is too expensive. As a linkage method for hierarchi¬ 
cal clustering we use average linkage, since it is applied to general distance 
or dissimilarity measures. 

From the clustering methods presented in this computational study, the 
normalized cut framework (Shi and Malik, 2000) is applied for the first 


time in phylogenetics, to the best of our knowledge. Normalized cut can be 
employed for clustering using only a similarity or dissimilarity matrix; we 
can observe that the coordinates of the original data points are not necessary. 
To properly apply the normalized cut framework in a clustering setting the 
only required input is the set of data points (each represented by a node in 
an undirected graph) forming node set V, and a set of weights of similarity 
between them (the edge set, E. of the graph). Then, the normalized cut 
framework aims to detect a bipartition of the node set of the graph in two 
node sets, (£, S), such that 0 is minimized with S U S = V and S n S = 0. 


NCut{S, S) 


cut(S, S ) cut(S, S ) 
assoc(S,V ) assoc(S, V) 


(1) 
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More recently, the problem has been studied by Hochbaum (2010) and 


Hochbaum (2013), where normalized cut variants are discussed, with some 
of them being shown to be solvable in polynomial time. Among them, of 
interest to the clustering community would be the “normalized cut” problem 
of Sharon et al. (2006), which is nothing more but a single version of the 


original normalized cut criterion shown in 0 and the ratio regions problem 
(Cox et al., 1996). In Hochbaum (2010) both the ratio regions and “normal¬ 
ized cut” problems were shown to be poly-time solvable. The normalized 
cut framework has been successfully applied to numerous applications, in- 


eluding image segmentation (S 

hi and Malik 

2000, 

Carballido-Gamio et al. 

2004; Yao et al., 2012), biology 

(Xing and Karp, 

2001, Higham et al. | 2007) 

and social networks ( 

Newman 

2013 

)• 


The normalized cut is known to be solved approximately (with typically 
good performance) as a generalized eigenproblem, which admits a straight¬ 
forward and easy to implement solution. In our experiments, for simplicity, 


the similarity Wij is given by Wij = exp(—where Dij is the distance 
matrix, and o = 1.2 x Median{Djj | i j}. 


2.3 Dimension reduction 


As an optional procedure before clustering, a low dimensional expression 
of gene trees may be extracted from the distance matrix. Among various 
dimension reduction methods, kernel principal component analysis (KPCA, 


Scholkopf et al., 

1998 

) and t-stochastic neighborhood embedding (t-SNE, 

van der Maaten and Hinton 

2008 

) are chosen for our analysis by preliminary 


experiments (we applied also spectral methods and Isomap, but the results 
were less favorable than KPCA and t-SNE.) Those methods extracted three 
dimensional expression of data, when applied, and the Ncut was applied to 
the Euclidean distance matrix among the three dimensional data points. 

KPCA is a nonlinear extension of the standard principal component anal¬ 
ysis (PCA); it applies PCA to feature vectors, which are given by nonlinear 
mapping of the original data to a feature space. The nonlinear map is de¬ 
fined by a positive definite kernel , and the feature space is a possibly infinite 
dimensional Hilbert space provided implicitly by the positive definite ker¬ 
nel. KPCA gives nonlinear functions j\,... of data points (Xi)fL 1 so that 
(fi(Xi ),..., fd{Xi))f =1 can serve as a d-dimensional representation of data. 
The analysis of this paper uses Gaussian kernel k(Xi,Xj ) = exp(— 
where D VJ is the distance matrix of the gene trcc^J 

1 While this kernel with an arbitrary distance matrix D is not necessarily positive 
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t-SNE is a method for low-dimensional expression or visualization of 
high-dimensional data; it typically extracts two or three dimensional ex¬ 
pression. Given in a high-dimensional space, t-SNE first computes 

a probability based on the distance matrix so that a high probability im¬ 
plies similarity of and Xj. The method then provides a low dimensional 
expression in such a way that a probability q tJ defined similarly for 

a pair (' Yi,Yj ) is close to ( ptj ). The points are found with numerical 

optimization to minimize the Kullback-Leibler divergence between (pij) and 
(q t j). In our experiments, a Matlab implementation by van der Maaten (lvd- 
maaten.github.io/tsne/) is used. The perplexity parameter, which gives a 
way of determining local bandwidth parameters, is set 30 in our experiments. 


3 Results 


We conducted numerical experiments with simulated datasets and a genome 
dataset. All our simulated datasets were generated using the software Mesquite 
(Maddison and Maddison, 2009). We first demonstrate that the normalized 
cut framework accurately clusters the set of gene trees given by a species 
tree under the coalescent process. Then, we proceed to compare different 
dimension reduction schemes and their performance as compared to cluster¬ 
ing via normalized cut directly on the original tree space. Last, we compare 
L-means and hierarchical clustering to our proposed approach. Two main 
observations throughout our obtained results are that hierarchical cluster¬ 
ing is not effective in recognizing clusters (as opposed to normalized cut 
and A:-means), and that the frameworks perform better on the gene trees 
reconstructed via the neighbor-joining (NJ) method (Saitou and Nei, 1987) 
than those reconstructed via the MLE under evolutionary models. 

The experimental design for the genome dataset in Subsection 3.2 is 
as follows. The three clustering methods were first applied to a genome¬ 
wide dataset on coelacanths, lungfishes, and tetrapods from Amemiya et al. 


(2013); Liang et al. (2013), where it was observed that there were two reli¬ 
able clusters in their 1290 genes. Based on the datasets, we reconstructed 
the consensus trees using NJ trees with bootstrap confidence for the clusters 
> 0.95 (see Subsection 3.2 for more details). We performed numerical ex¬ 
periments using the (typically used) Euclidean space and the (better suited) 
BHV tree space and compared the accuracy of the two spaces for the goal 
of recognizing more statistically accurate clusters. Last, we compared dif- 


definite, in our analysis the Gram matrices k(Xi,Xj ) made by the given data were positive 
definite. 
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ferent preprocessing dimension reduction schemes for each and every one of 
the spaces, and the clustering techniques. 

Overall, we obtained consistent results with both the normalized cut and 
the /c-means frameworks on the consensus trees obtained. The consensus 
tree from one cluster (of 858 gene trees with the direct application of the 
normalized cut, of 761 gene trees with the normalized cut after applying 
KPCA, and of 817 gene trees with t-NSE normalized cut) supports the view 


of Fritzsch (1987); Gorr et al. (1991) that claims that coelacanths are most 


closely related to the tetrapods; furthermore, the consensus tree constructed 
from the other cluster (of 322 gene trees with the direct Ncut algorithm, 
of 320 gene trees with the KPCA Ncut algorithm, and of 463 gene trees 
with the t-NSE Ncut) supports the view of Takezaki et al. (2004), that 


is, the coelacanth, lungfish, and tetrapod lineages diverged within a very 
short time interval and that their relationships may represent an irresolvable 
trichotomy. We now proceed to describe our results in more detail on both 


tetrapods (see Subsection 3.2). 


simulated datasets (see Subsection 3.1) and coelacanths, lungfishes, and 


3.1 Simulated data sets 

The simulated data is generated as follows. We have fixed the popula¬ 
tion size N e = 10,000 and we set the species depth c • N e where c = 
0.6, 0.7,0.8, 0.9,1,1.2,1.4,1.6,1.8, 2. Then for each species depth c • N e , we 
generated 100 species trees from the Yule process and we picked randomly 
two trees from them. With each species tree, we generated 1000 random 
gene trees under the coalescent process within the species tree using the soft- 


ware Mesquite ( 

Maddison and Maddison, 2009 

). To generate the sequences 

we have used the software PAML ( 

Yang 

1997 

) under the Jukes-Cantor (JC) 


(Jukes and Cantor, 1969) +T model with k = 4.0, and the number of cate¬ 


gories of the discrete gamma model is 1 with a = 1.0 for one set of gene trees 
under the first species tree and we generated sequences under the GTR +T 
model with k = 4.0, and the number of categories of the discrete gamma 
model is 1 with a = 1.0 for the other set of gene trees under the second 
species tree. The frequencies for T, C, A, and G in the data are set as 
0.15, 0.35, 0.15, 0.35, respectively. We set the length of sequences as 500. 
To reconstruct trees from these DNA sequences, we used the NJ algorithm 


with the p-distance (Saitou and Nei, 1987) (we call NJp method from here 


on) to reconstruct the NJ trees, and used the software PHYML (Guindon and 


Gascuel, 2003) to reconstruct MLE trees under the GTR model (Felsenstein 


1981), the Hasegawa-Kishino-Yano (HKY) (Hasegawa et al . 1985) model, 
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-®— NJp KPCA 
-■©•■-NJp, t-SNE 
-■©—NJp, direct 
—GTR, KPCA 
-O-GTR, t-SNE 
GTR, direct 
—®— HKY, KPCA 
-© •HKY, t-SNE 
--e-HKY, direct 
-®— K80, KPCA 
K80, t-SNE 
— ©-K80, direct 
-©— JC, KPCA 

•■■■©••« JC, t . SNE 

—e— JC, direct 


Figure 1: Ncut Clustering accuracy for simulated data. NJp gives superior 
accuracy than MLE. The results of MLE show three groups depending on 
the three clustering methods. 


and the Kimura 2 parameter (K80) model (Kimura 1980); they are denoted 
by MLE-GTR, MLE-HKY, and MLE-K80, respectively, from now on. 

Fig. a shows the rates of correctly clustered genes by the three newly 
proposed clustering schemes: direct Ncut, KPCA Ncut, and t-SNE Ncut. 
Here and in the sequel, direct means direct application of a clustering meth¬ 
ods without dimension reduction. Generally the accuracy is higher for larger 
species depths, which imply clearer separation. There is a significant dif¬ 
ference of accuracy between NJp and MLE tree reconstruction methods; 
the NJp method (solid lines) gives better clustering for all the three clus¬ 
tering methods. It is also noted that the accuracy for the MLEs has clear 
groups based on the clustering schemes; t-SNE Ncut (broken lines), direct 
Ncut (dashed lines), and KPCA Ncut (dotted lines) give groups of similar 
accuracy levels in this order. 

To show the advantage of using the BHY tree space over the Euclidean 
space, we applied the same clustering methods to Euclidean distance ma¬ 


trices D(T) in Mis/, and compared the clustering accuracy obtained. The 
differences (which are easily noted) are shown in Table [lj however we focus 
only on depth ratios c = 0.8 and 1.2 for ease of presentation; the remaining 
depths also portray the same differences and can be found in the Supple¬ 
mentary Material. We observe that in most cases, the BHV tree space 
gives better clustering accuracy than when using Euclidean distances. Even 
though the Euclidean distance of MLE-HKY with KPCA and t-SNE for 
c = 0.8 and direct for c = 1.2 gives more accurate clustering, these results 
are much lower than the ones obtained by NJp. 
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-e— NJp KPCA 
-O-NJp, t-SNE 
—GTR, KPCA 
...0-.-GTR, t-SNE 
— HKY, KPCA 
-0--HKY, t-SNE 
—©— K80, KPCA 
-■» K80, t-SNE 
-o— JC, KPCA 
•■•■»•••• JC, t-SNE 


Figure 2: £>means Clustering accuracy for simulated data. It performs 
similarly to the normalized cut framework; the main difference is the lack 
of a direct application of fe-means on the datasets. 


To show the performance of the normalized cut compared to other stan¬ 
dard clustering methods in the field, we performed the same experiments 
using the well-known fc-means and hierarchical clustering. The k-means 
clustering is infeasible for BHV tree space, but instead it can be applied 
with a dimension reduction scheme before putting them to the test. As 
such, there is no direct application in the results shown in Fig. [2j 

From the Figures, we can see that k -means is indeed a viable option 
for accurately clustering the trees, performing similarly to our proposed 
normalized cut framework. On the other hand, hierarchical clustering has 
proven to be very bad in clustering in this context, resulting in significantly 
imbalanced clusters. We tried other linkage methods, but the results were 
similar. From our computational study, it can be concluded that normalized 
cut is very effective in reproducing the cluster structure in gene trees when 
using BHV distances; moreover, the NJp method is superior in recognizing 
clusters when compared to MLE methods. Hierarchical clustering, on the 
other hand, is not recommended in this context. Last, a main advantage of 
the normalized cut framework is that it requires no dimension reduction, but 
instead can be directly applied on the BHV space. Computational results of 
clustering accuracy for k -means are also presented in Table [2j the results for 
hierarchical clustering, as it performs poorly, are omitted but are provided 
for completeness in the Supplementary Material. 
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-®— NJp KPCA 
-■■©■•-NJp, t-SNE 
-■©—NJp, direct 

- ® -GTR, KPCA 
-e—GTR, t-SNE 
-O-GTR, direct 
-o-HKY, KPCA 

- © - HKY, t-SNE 
—«—HKY, direct 
■••©•••• K80. KPCA 
—e- K80, t-SNE 

- © - K80, direct 
-e— JC, KPCA 
-o JC, t-SNE 
—©—JC, direct 


Figure 3: Hierarchical Clustering accuracy for simulated data. It is easy to 
see that the results obtained are statistically far worse than the other two 
main clustering techniques. 



KPCA 

NJp 

t-SNE 

direct 

KPCA 

MLE-GTR 

t-SNE 

direct 

BHV 

Euclid 

0.838 

0.840 

0.824 

0.860 

0.836 

0.750 

0.522 

0.740 

0.674 

0.826 

0.620 

0.740 


KPCA 

NJp 

t-SNE 

direct 

KPCA 

MLE-GTR 

t-SNE 

direct 

BHV 

Euclid 

0.988 

0.958 

0.976 

0.962 

0.982 

0.954 

0.872 

0.828 

0.944 

0.936 

0.738 

0.828 


Table 1: Comparison of clustering accuracy between BHV space and Eu¬ 
clidean space when using normalized cut. Euclidean distance gives worse 
results than geodesic distances in the BHV tree space. BHV geodesic dis¬ 
tance with NJp tree construction is the most suitable for clustering. 


3.2 Genome data set on coelacanths, lungfishes, and tetra- 
pod 

On top of the simulated datasets, we have also applied the clustering meth¬ 
ods to the dataset comprising 1,290 nuclear genes encoding 690,838 amino 
acid residues obtained from genome and transcriptome data by|Liang et al. 
(2013). Over the last decades, the phylogenetic relations between coela¬ 


canths, lungfishes, and tetrapods have been controversial despite the exis¬ 
tence of many studies on them (Hedges, 2009). Most morphological and 


paleontological studies support the hypothesis that the lungfishes are closer 


to the tetrapods than coelacanths (Tree 1 in Figure 1 from Liang et al. 


(2013)), however, there exists research in the field that supports the hy¬ 
pothesis that the coelacanths are closer to the tetrapods (Tree 2 in Figure 
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(a) c = 0.8 



KPCA 

NJp 

t-SNE 

direct 

KPCA 

MLE-GTR 

t-SNE 

direct 

BHV 

Euclid 

0.838 

0.842 

0.806 

0.840 

N/A 

0.830 

0.618 

0.740 

0.648 

0.826 

N/A 

0.740 


KPCA 

NJp 

t-SNE 

direct 

KPCA 

MLE-GTR 

t-SNE 

direct 

BHV 

Euclid 

0.992 

0.956 

0.984 

0.962 

N/A 

0.956 

0.742 

0.828 

0.866 

0.936 

N/A 

0.828 


Table 2: Comparison of clustering accuracy between BHV space and Eu¬ 
clidean space when using k-means. We observe that the normalized cut 
framework and £;-means both perform well. There is no direct application 
of k -means on the BHV space, which is a main advantage of the normalized 
cut. 


1 from Liang et al. (2013)). Others support the hypothesis that the coela- 


canths and the lungfishes form a sister clades (Tree 3 in Figure 1 from Liang 


et al. (2013)) or tetrapodes, lungfishes, and coelacanths cannot be resolved 


(Tree 4 in Figure 1 from Liang et al. (2013)). In this subsection, we apply 
the normalized cut framework for clustering to the genome data set from 


Liang et al. (2013) and analyze each obtained cluster. 


We applied the clustering methods (with and without a dimension reduc¬ 
tion) to the distance matrix computed from the set of gene trees constructed 
by the NJp method. The number of clusters in Ncut was set to two, that is, a 
bipartition. Fig.[4]shows the clustering results with KPCA and t-SNE, plot¬ 
ted on the three dimensional space found by the dimension reduction. The 
red and blue colors show the two clusters, where the color density represents 
the bootstrap confidence explained below. 

To evaluate the stability of clustering, we computed a bootstrap con¬ 
fidence probability for each gene. Namely, given N X N distance matrix 
(. Dij ) as an input to the Ncut framework, we generated random resam¬ 
pling {i ±,..., *tv} from {1,..., N} with replacement, and applied Ncut to 
(D ia iX b =v We repeated this procedure 100 times with independent ran¬ 
dom indices, and computed the ratio that a gene is classified in the same 
cluster as the one given by (Dij). 

We computed the bootstrap confidence for all 1,290 genes. The cumula¬ 
tive distribution functions of these values are shown for the tree clustering 
methods in Fig. [5] (left). The ratio of genes with confidence above 0.95 is 
91.4%, 83.8%, and 99.2% for direct Ncut, KPCA Ncut, and t-SNE Ncut, 
respectively. For comparison, we computed the bootstrap confidence for 
Ncut with three clusters. Fig. [5] (right) shows the cumulative distribution 


13 





























Ncut clustering (KPCA) 



Ncut clustering (t-SNE) 




Figure 4: Clustering of the genome data set. The two clusters are depicted 
in red and blue with bootstrap confidence shown by color density. 


function, which clearly reveals that three clusters are unstable. From these 
observations, we see that the two clusters obtained by the methods are not 
artifacts but a stable structure in the genome data. 

The clusters obtained by the three methods look different in their shapes. 
We then examined agreements of the clusters at the gene level. After ex¬ 
tracting the genes with bootstrap confidence not less than TH, (TH = 0.90 
or 0.95), we evaluated the agreement of methods A and B by 

icjnc^l + |cSnc|| 
tAB •“- W A -’ 

where Na is the number of genes by Method A with confidence larger than 
TH and C\ is the 2 -th (i = 1, 2) cluster by Method A (Na = \C\\ + |C^|). 
We identified which cluster in A corresponds to a cluster B by the number 
of common genes. Table [3] shows the value tAB for every pair of the three 
methods. We can see that majority of genes in a cluster agrees to another 
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Empirical CDF 


Empirical CDF 




Figure 5: Cumulative distribution functions of confidence values for cluster¬ 
ing. The two clusters (left) are reliable, while the three clusters (right) are 
unstable. 


(a) TH=0.90 


A\B 

direct KPCA 

t-SNE 

N a 

Direct 

0.917 

0.800 

1207 

KPCA 

0.912 

0.757 

1135 

t-SNE 

0.812 0.785 

- 

1284 


(b) TH = 0.95 


A\B 

direct KPCA 

t-SNE 

N a 

Direct 

0.896 

0.786 

1180 

KPCA 

0.886 

0.712 

1081 

t-SNE 

0.803 0.757 

- 

1280 


Table 3: Agreement of clusters among the three methods for the normalized 
cut. The rightmost column shows the number of selected genes for each 
method ( Na ). 


cluster given by a different method. This confirms that the clustering reveals 
the structure of the data. KPCA Ncut and t-SNE Ncut are slightly less 
consistent, which may be caused by the difference of Na for the two methods. 

Finally we conducted the phylogenetic analysis on the clusters of gene 
trees. For each clustering method (direct Ncut, KPCA Ncut, and t-SNE 
Ncut), we have reconstructed a consensus tree from each cluster. To con¬ 
struct the consensus tree, we have used the gene trees in each cluster with 
bootstrap value greater than 0.95 and took the majority rule with more 
than 50% for reconstructing the consensus tree for resolving each split on 
the tree. With all the clustering methods, the result suggests that there 
are two clusters in the genome-wide data set on coelacanths, lungfishes, and 
tetrapods: the number of genes are (858,322), (761,320), and (817,463) for 
direct Ncut, KPCA Ncut and t-SNE Ncut, respectively. Note that we used 
only gene trees with bootstrap confidence > 0.95. 
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With all of the three methods, direct Ncut, Ncut with KPCA, and Ncut 
with t-SNE, one cluster of the gene trees provides the tree topology Tree 4 
from Figure 1 in |Liang et al\ (2013), while the other cluster gives the tree 
topology Tree 2 from Figure 1 in Liang et al. (2013) (see Fig. [6]). 

We have also reconstructed a tree from each cluster by concatenating the 
alignments using the software PhyloBayes 3.3 under a mixture model CAT 
+r4 with two independent MCMC runs for 10,000 cycles. However, we did 
not observe any difference in the tree topologies, i.e., the reconstructed trees 


have all the same tree topology as Tree 1 in from Figure 1 in Liang et al. 
(2013) (see Fig. [if. 


4 Discussion 


In this paper we have shown three main results: the Ncut clustering al¬ 
gorithm works well on the gene trees reconstructed via the NJp under the 
evolutionary models; via the Ncut clustering algorithm we found two clusters 


on the genome data sets from Liang et al. (2013); last, fc-means performs 


equally well after dimension reduction, while hierarchical clustering is always 
outperformed in this context. 

Simulations: As we have shown by simulations in Section [3j the normalized 
cut framework works effectively on the set of gene trees reconstructed via 
the NJp method compared to the trees reconstructed via the MLE under 
the evolutionary models (Table i Fig. a as well as Fig. [2] and [3] for 
the other methods). It is not clear why this phenomenon appears in our 
computational study and it is of interest to investigate mathematically the 
reason it happens. 

Coelacanths, lungfishes, and tetrapods data set: Using the Ncut algorithm on 



Figure 6: The majority rule consensus tree consists of gene trees with more 
than 0.95 bootstrap values in each cluster. Each split in the trees is resolved 
only if we have majority, i.e. 50% of all given gene trees in each set agree. 
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the gene trees reconstructed via the NJp method, we were able to identify 
two clusters. Bootstrap confidence analysis suggests that these are two 
reliable clusters and it appears to be very unlikely to have more than two 
clusters (see Fig. [5j). From the two clusters we were able to find using the 
Ncut framework, we have reconstructed the consensus trees and their tree 
topologies did not support the hypothesis that the lungfishes are the closest 
living relatives of the tetrapods as in Liang et al. (2013), but supported the 
hypotheses that the coelacanths are most closely related to the tetrapods, 
and that the coelacanth, lungfish, and tetrapod lineages diverged within a 
very short time interval. Since clustering analysis with Ncut does not infer 
any evolutionary events that caused the clusters, it would be interesting 
and important to further investigate how these clusters were made in the 
evolutionary history. 
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Figure 7: The reconstructed trees obtained by concatenating the alignments 
from each cluster after using direct Ncut. For this result, we employed 
the Bayesian inference using the software PhyloBayes 3.3 under a mixture 
model CAT +T4 with two independent MCMC runs for 10,000 cycles. 
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